Linear Models for Regression

We define observation $\mathbf{x} \equiv (x_{1},\ldots,x_{p})^\intercal$ with $p$ attributes and target value $y$. Given a training set $\mathcal{D}$ with

  • $n$ observations $\mathbf{x}_i \equiv (x_{i1},\ldots,x_{ip})^\intercal$, $\forall i \in N = \{1, \ldots, n\}$, and
  • target values $\mathbf{y} \equiv (y_1,\ldots,y_n)^\intercal$

We shall fit the data using a linear model \begin{align} f(\mathbf{x}, \mathbf{w}) = w_0 + w_1 \phi_1(\mathbf{x}) + w_2 \phi_2(\mathbf{x}) + \cdots + w_{m-1} \phi_{m-1}(\mathbf{x}) \nonumber \end{align} where $\phi_i$ is a transformation of $\mathbf{x}$ called basis functions.

In the matrix form, we have \begin{align*} f(\mathbf{x}, \mathbf{w}) &= \mathbf{\Phi} \mathbf{w} \\ \end{align*} where \begin{align*} \mathbf{w} &= (w_0, \ldots, w_{m-1}) \end{align*} and \begin{align*} \mathbf{\Phi} &= \left\{ \begin{array}{l} \phi(\mathbf{x}_1)\\ \phi(\mathbf{x}_2) \\ \vdots \\ \phi(\mathbf{x}_n) \end{array} \right\} = \left\{ \begin{array}{lllll} \phi_0(\mathbf{x}_1) = 1 & \phi_1(\mathbf{x}_1) & \phi_2(\mathbf{x}_1) & \ldots & \phi_{m-1}(\mathbf{x}_1) \\ \phi_0(\mathbf{x}_2) = 1& \phi_1(\mathbf{x}_2) & \phi_2(\mathbf{x}_2) & \ldots & \phi_{m-1}(\mathbf{x}_2) \\ \vdots & \vdots & \vdots & \ddots& \vdots\\ \phi_0(\mathbf{x}_n) = 1& \phi_1(\mathbf{x}_n) & \phi_2(\mathbf{x}_n) & \ldots & \phi_{m-1}(\mathbf{x}_n) \end{array} \right\}_{n \times m} \\ \end{align*}

  • When $\phi_j(\mathbf{x}) = x_j$, we have linear regression with input matrix \begin{align} \mathbf{\Phi} = \mathbf{X}_{n \times p} = (\mathbf{x}_1, \ldots, \mathbf{x}_n)^\intercal. \nonumber \end{align}

  • When $p=1$ and $\phi_j(x) = x^j$, we have polynomial fitting.

Note that the matrix $\mathbf{X}$ should be of size $n \times (p+1)$ because of regression constant. For the purpose of simplifying notations, we assume $x_{i1} = 1$ $\forall i \in N$ and hence ignore the notational inconvenience of dimension $p+1$.

In a frequentist setting,

  • $\mathbf{w}$ is considered to be a fixed parameter, whose value is determined by some form of "estimator", and
  • error on this estimate are obtained by considering the distribution of possible observed data sets $\mathcal{D}$.

In a Bayesian setting,

  • We assume a prior probability distribution $p(\mathbf{w})$ before observing the data.
  • The effect of the observed data $\mathcal{D}$ is expressed through $p(\mathcal{D}|\mathbf{w})$, i.e., likelihood function.
  • Bayes' theorem \begin{align*} p(\mathbf{w}|\mathcal{D}) &= \frac{p(\mathcal{D}|\mathbf{w}) p(\mathbf{w})}{p(\mathcal{D})},\ \text{i.e., posterior} \propto \text{likelihood $\times$ prior} \end{align*}

evaluates the uncertainty in $\mathbf{w}$ after $\mathcal{D}$ is observed, where

  • $p(\mathbf{w}|\mathcal{D})$ is the posterior probability after $\mathcal{D}$ is observed,
  • $p(\mathcal{D})$ is the probability of evidence.

Least Squares

We minimize residual sum of squares (RSS) \begin{align*} \text{RSS}(\mathbf{w}) = \sum_{i=1}^{n} (y_i - f(\mathbf{x}_i, \mathbf{w}))^2 = (\mathbf{y} - \mathbf{\Phi} \mathbf{w})^\intercal(\mathbf{y} - \mathbf{\Phi}\mathbf{w}) \end{align*}

The first order condition gives \begin{align*} \mathbf{\Phi}^\intercal (\mathbf{y} - \mathbf{\Phi}\mathbf{w}) = 0 \end{align*}

and solving for $\mathbf{w}$ leads to the normal equation \begin{align*} \mathbf{\hat{w}} = (\mathbf{\Phi}^\intercal\mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{y} = \mathbf{\Phi}^\dagger \mathbf{y} \end{align*}

where $\mathbf{\Phi}^\dagger$ is the Moore-Penrose pseudo-inverse of the matrix $\mathbf{\Phi}$.

Suppose $y_i$ are uncorrelated with constant variance $\sigma^2$, and $\mathbf{x}_i$ are fixed. We have

Theorem

$E(\mathbf{\hat{w}}) = \mathbf{w}$ and $\text{Var}(\mathbf{\hat{w}}) = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \sigma^2$.

Proof

$\mathbf{\hat{w}}$ is an unbiased estimate of $\mathbf{w}$:

\begin{align*} \text{E}(\mathbf{\hat{w}}) & = \text{E}((\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{y}) = \text{E}((\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi}\mathbf{w}) = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi} \text{E}(\mathbf{w}) = \mathbf{w} \end{align*}

Because

\begin{align*} \mathbf{\hat{w}} - \text{E}(\mathbf{\hat{w}}) & = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{y} - (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi} \mathbf{w} = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal (\mathbf{y} - \mathbf{\Phi} \mathbf{w}) \\ &\equiv (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \epsilon \\ \text{E}(\epsilon \epsilon^\intercal) &= \text{Var}(\epsilon) = \sigma^2 \mathbf{I}_{n}, \end{align*}

the variance is

\begin{align*} \text{Var}(\mathbf{\hat{w}}) &= \text{E}((\mathbf{\hat{w}} - \text{E}(\mathbf{\hat{w}}))(\mathbf{\hat{w}} - \text{E}(\mathbf{\hat{w}}))^\intercal) = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \text{E}(\epsilon \epsilon^\intercal) \mathbf{\Phi} (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \\ &= (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \mathbf{\Phi}^\intercal \mathbf{\Phi} (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \sigma^2 = (\mathbf{\Phi}^\intercal \mathbf{\Phi})^{-1} \sigma^2 \end{align*}